Combine the conversion of UTF-8 to bytes into a single base function #22703

khwilliamson · 2024-10-26T14:18:05Z

There are two functions that do this currently, utf8_to_bytes and bytes_from_utf8 This combines the guts of their implementation.

It also creates 3 new public macros to call the base function with a somewhat different API that is more convenient to use, and can avoid mallocs when not needed

The first three commits in this series are needed as a base for the rest, but are already in individual pull requests.

This set of changes requires a perldelta entry, and it is not yet included.

Prior to this commit these were illegal. This causes embed.fnc to generate macro 'Perl_foo' #defined to be macro 'foo'. This could be used to easily convert existing macros into having long names should some become a name space pollution problem. This also documents in embed.fnc, under the 'm' flag discussion, how to use this instead of placing things in mathoms.c, or creating stub functions.

Commit e4d3d0c removed all the calls to this function.

Outdent and reflow some comments and code in preparation for them to be moved out of the loop

This is for clarity. All this very-unlikely-to-be-used code was in the middle of what is really going on, creating a distraction.

The previous version did not make sure that it wasn't reading beyond the end of the buffer in all cases, and the first pass through the input string already ruled out it having most problems. Thus we don't need the full generality here of the macro UTF8_IS_DOWNGRADEABLE_START; and this simplifies things

These were misleading. On ASCII platforms, many calls to this function won't use the per-word algorithm. That's only done for long-enough strings.

The new name, s0, is used in more other places for this meaning, and is more descriptive.

This is an internal function, designed to be an extension of utf8_to_bytes(), with a slightly different API. This function just adds it and calls it from just utf8_to_bytes. Future commits will extend this API.

This adds a third return possibility to this new function. Its sole caller still treats it as a boolean for now.

This variable should not be being changed by the function

The argument is currently unused. The macro is a public facing API that calls this function with the correct argument

This makes the next commit smaller

This causes this function to be able to both overwrite the input, and to instead create new memory. It changes bytes_from_utf8() to use this new capability instead of being a near duplication of the core code of this function. Prior to this commit, bytes_from_utf8() just allocated memory the size of the original string, and started copying into it. When it came to a sequence that wasn't convertible, it stopped, and freed up the copy. The new behavior has it checking first before the malloc that the string is convertible. That has the advantage that there is no malloc without being sure it will be useful; but the disadvantage that there is an extra pass through the input string, but that pass is per-word. The next commit will introduce another advantage.

Prior to this commit, the size malloced was just the same as the length of the input string, which is a worst case scenario. This commit changes so the new pass through the input (introduced in the previous commit) also calculates the needed length. The additional cost of doing this is minimal. It has advantages on a very long string with lots of sequences that are convertible.

This is a non-destructive conversion of the input into native bytes, and with any new memory required set for destruction via SAVEFREEPV. This allows the caller to not have to be concerned at all if memory was created or not. A new macro is created that calls this internal function with the correct parameter to force this behavior.

khwilliamson · 2024-10-28T19:52:08Z

This pull request is going forward instead via #22638

khwilliamson added 18 commits October 26, 2024 06:49

embed.fnc: Convert some tabs to blanks

7c1a674

Remove unused internal function bytes_from_utf8_loc

24af9ac

Commit e4d3d0c removed all the calls to this function.

utf8.c: Move declaration to first use

fe0f423

utf8.c: White-space only

5057e3f

Outdent and reflow some comments and code in preparation for them to be moved out of the loop

utf8_to_bytes() Move failure code out of loop

7c5c6df

This is for clarity. All this very-unlikely-to-be-used code was in the middle of what is really going on, creating a distraction.

utf8_to_bytes: Update and fix comments.

4689c3a

These were misleading. On ASCII platforms, many calls to this function won't use the per-word algorithm. That's only done for long-enough strings.

utf8_to_bytes: Rename variable

cb9b819

The new name, s0, is used in more other places for this meaning, and is more descriptive.

Add preliminary utf8_to_bytes_()

695990e

This is an internal function, designed to be an extension of utf8_to_bytes(), with a slightly different API. This function just adds it and calls it from just utf8_to_bytes. Future commits will extend this API.

utf8_to_bytes_: Return an enum instead of bool

d89a5d3

This adds a third return possibility to this new function. Its sole caller still treats it as a boolean for now.

utf8_to_bytes_opts: Add const

cbd80a8

This variable should not be being changed by the function

utf8_to_bytes_: Add argument, macro

d0fe511

The argument is currently unused. The macro is a public facing API that calls this function with the correct argument

utf8_to_bytes_: Slight refactor

34a71e0

This makes the next commit smaller

Document new utf8_to_bytes() variants

ee373c8

khwilliamson force-pushed the utf8_to_bytes_opts branch from b99cd7d to ee373c8 Compare October 26, 2024 14:23

github-actions bot added the hasConflicts label Oct 28, 2024

khwilliamson closed this Oct 28, 2024

khwilliamson mentioned this pull request Oct 28, 2024

refcounted_he_(new|fetch)_pvn: Don't roll-own code #22638

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combine the conversion of UTF-8 to bytes into a single base function #22703

Combine the conversion of UTF-8 to bytes into a single base function #22703

khwilliamson commented Oct 26, 2024 •

edited

Loading

khwilliamson commented Oct 28, 2024

Combine the conversion of UTF-8 to bytes into a single base function #22703

Combine the conversion of UTF-8 to bytes into a single base function #22703

Conversation

khwilliamson commented Oct 26, 2024 • edited Loading

khwilliamson commented Oct 28, 2024

khwilliamson commented Oct 26, 2024 •

edited

Loading